⚡️ Speed up function generate_id_within_group by 50%
#29
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 50% (0.50x) speedup for
generate_id_within_groupindatacompy/core.py⏱️ Runtime :
63.8 milliseconds→42.6 milliseconds(best of16runs)📝 Explanation and details
The optimization achieves a 49% speedup through three key performance improvements:
1. Reduced DataFrame Subsetting
dataframe[join_columns]once asjoin_dfinstead of repeatedly accessing it2. Faster Null Detection
.isnull().any().any()with.isnull().to_numpy().any().any()on boolean arrays is significantly faster than pandas' chained.any().any()operations3. Efficient Value Collision Check
(values_array == default_value).any()instead of DataFrame equality checkingImpact on Hot Path Usage:
The function is called within
_dataframe_merge()when duplicate rows are detected (self._any_dupesis True), creating temporary order columns for unique matching. Since this happens during DataFrame merging operations - a core functionality of the datacompy library - these optimizations will significantly improve performance for datasets with duplicate rows.Test Case Performance Patterns:
✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
test_core.py::test_generate_id_within_grouptest_core.py::test_generate_id_within_group_valueerror🌀 Generated Regression Tests and Runtime
⏪ Replay Tests and Runtime
To edit these changes
git checkout codeflash/optimize-generate_id_within_group-mi5tvkj8and push.